Skip to content

feat: extensible log management#3762

Merged
didier-wenzek merged 13 commits intothin-edge:mainfrom
albinsuresh:feat/extensible-config-management
Oct 2, 2025
Merged

feat: extensible log management#3762
didier-wenzek merged 13 commits intothin-edge:mainfrom
albinsuresh:feat/extensible-config-management

Conversation

@albinsuresh
Copy link
Contributor

@albinsuresh albinsuresh commented Aug 21, 2025

Proposed changes

  • Spec for extensible log management
  • Spec for extensible config management
  • Example plugins showcasing the proposal
  • Extract file log plugin for file based logs
  • Finalize the path where the log/config plugins would be installed as per the layered config spec
  • inotify support to refresh supported log list on new plugin installations and plugin config updates

TBD in a follow-up PR:

  • Extension file support for tedge-log-plugin.toml
  • Agent to respect plugins.include and plugins.exclude patterns in log plugins configs

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue


Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s. You can activate automatic signing by running just prepare-dev once)
  • I ran just format as mentioned in CODING_GUIDELINES
  • I used just check as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

@albinsuresh albinsuresh force-pushed the feat/extensible-config-management branch from 94512f1 to 2db7de2 Compare August 21, 2025 06:42
@albinsuresh albinsuresh requested a review from rina23q August 21, 2025 07:37
Comment on lines +39 to +59
- The plugin directory would also have a `conf.d` directory where each plugin can store their configuration in a toml file.
These config files are used to turn type discovery on/off and to define inclusion/exclusion list when type discovery is turned on:
```docker.toml
[docker]
auto_discover=true

[docker.include]
container_name_regex = "nginx"

[docker.include]
image_name_regex = ""

[docker.exclude]
container_name_regex = "kube-*"
```
- The main config file for a plugin must be named after the plugin itself (e,g: `docker.sh` -> `docker.toml`).
Key configs like `auto_discover` can only be defined in this main file.
The inclusion/exclusion list can be defined in extension files as well,
but the table names with the type prefix (e.g: `[docker.include]` and `[docker.exclude]`) must be used in those as well.
- When new software is installed, their corresponding `include/exclude` entries can be appended to the main plugin config itself,
or created in an independent extension file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to have configuration as part of the interface. Each plugin could be free to have its own configuration or not, however we don't need to dictate the format/contents of it. Yes we could encourage people to place the plugins in a central location (e.g. /etc/tedge/plugins/logs, /etc/tedge/plugins/config/), but I wouldn't enforce it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was just trying to enforce a consistent structure to the plugin configs like the include and exclude sections. My plan was to allow the plugin devvs to define additional tables in the main plugin config to describe plugin specific settings as well. But you're right that it is not up to the agent to enforce keys like container_name_regex, unit_name etc. So, I'm wondering where to draw the line between what the agent enforces and what the plugin controls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess these settings are actually tedge-agent settings which are used as post filtering on the plugins output, so that users can restrict the selection of types without having to modify the actual plugin...I'm ok with that idea, though it wasn't that clear from reading the docs at least (but maybe that was just me).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have detailed it out a bit more in 9182576

Comment on lines +83 to +93
- The `config-plugins` also has a `conf.d` sub-directory where extensions of the main config file can be created dynamically.
Each extension config can have entries as follows:
```
[[files]]
type = "collectd"
path = "/etc/collectd/collectd.conf"

[[files]]
type = "nginx"
path = "/etc/nginx/nginx.conf"
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though how do we tie the these configuration snippets to the specific configuration type handler? e.g. lets call the existing file based config handler that tedge-agent supports as the "file" handler. The configuration items need to be marked as to be handled specifically by the "file" handler and nothing else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that's not described clearly in the spec. I'll update that. When a request comes for a type, the agent looks up if a plugin with the same name exists. If yes, that plugin is used to handle this config type. Else, we fall back to the existing file handler which is the inbuilt default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though we're probably going to have to namespace these types...as I'm sure there will be conflicts between different plugins, e.g. "mosquitto" (file) and "mosquitto" (container)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't add any additional namespace for these config plugins as there was a 1:1 mapping between types and plugins. If we make config plugins also support multiple types, then yeah, this would be required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was decided to have config plugins support multiple config types and those config types would be suffixed with ::<plugin_type> when reported as supported config types, which provides that namespaced isolation.

Comment on lines +113 to +140
[executing]
action = "builtin:config_update:executing"
on_success = "download"

[download]
action = "builtin:config_update:executing"
on_success = "prepare"

[prepare]
action = "builtin:config_update:prepare"
on_success = "apply"

[apply]
action = "builtin:config_update:apply"
on_success = "validate"

[apply]
action = "builtin:config_update:apply"
on_success = "commit"
on_error = "rollback"

[apply]
action = "builtin:config_update:apply"
on_success = "successful"

[rollback]
action = "builtin:config_update:rollback"
on_success = "failed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be in favor of just having a "file" binary which does the handling of the current file-based configuration management (similar to what we do already for the tedge-apt-plugin). The only in-built functions that we could expose as builtin functions is the actually uploading and downloading of the files (though I suspect some of that work is already done by the tedge-mapper-c8y).

Copy link
Contributor Author

@albinsuresh albinsuresh Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly how it will be done. I'd just clarify one thing, which I realise isn't communicated that well in the spec after reading the comments. The same tedge-config-plugin.toml would have multiple file entries possibly handled by different plugins named after those types. For example, a collectd config file can still be listed in the tedge-config-plugin.toml file, but can be handled by a collectd plugin. The association is done purely by the type value specified in the config file and a similarly named plugin. If a named plugin does not exist for that type, then we just use the built-in plugin.

Unlike the default file plugin proposed for log management, I wasn't initially planning of of extracting this builtin behaviour into a file plugin for config as well. The plan was to keep it with the agent itself. But, for the sake of consistency, I guess we can extract that out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was decided to have a dedicated file plugin for config management as well (as it was done for log management), as that enables that plugin to be individually executed with elevated rights while the tedge-agent still runs as tedge user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question about when to publish the lists of config/log types which are reported by each plugin.

If we have a plugin concept, when do we re-evaluate changes in the types? e.g. if a docker log plugin allow users to fetch logs from existing containers, then the user needs to know which containers are currently running in order to fetch it.

Possible solutions

  • After every "side-effect" operation (device profile, firmware, software, configuration), we execute the plugin list command? Then we rely on change detection to avoid publishing identical lists to the cloud each time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The planned contract was to get the thing that installs the new application to just update the main tedge-config-plugin.toml or tedge-config-plugin.toml files or add an entry into their corresponding conf.d directories. When there are no explicit entries to add (for plugins with auto-discovery turned on), they can even just touch those files or directories without making any real edits.

The reasoning behind this was to have a contract not just for our own managed operations like device_profile, firmware_update etc but an external process can also trigger the same, if an installation was done outside the purview of tedge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For configuration management this might work, but I'm not so sure about log management, as the log plugins generally have dynamic values (e.g. as containers are added and removed, and a container does not have any knowledge about how to collect its logs, or that it even exists, that responsibility would be for the container-aware plugin that can detect added/removed containers and knows how to fetch the logs from them).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For containers, there would still be some process on the device that spawns the new containers, right? Couldn't that perform this action? My assumption was that this "container manager" process would even know the exact container id that it just spawned and can add that entry to the config file if it wants to. Or just touch the directory when auto-discovery is turned on. Wouldn't that work? Or my assumption about this "container manager" process is wrong?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so the interface for telling the tedge-agent which log types are there is file based, so any other process will have to add/remove files as containers are created and destroyed...though this would rely on a process always running. so if we wanted to do support a journald log provider, then it would need to be another service which is run, or does every individual service just need to create its own log entry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since new configs or logs are pretty much always added as a result of software installation, we can extend the agent's software update contract to include a refresh of the supported config list and log list as well, in addition to the software list update that happens already whenever a new software is installed. That way, even when a new config/log plugin is installed using some packages, its refresh will happen automatically.

Currently the updated software list request is triggered by the mapper. We can move that to the agent so that it. does this all-in-one refresh whenever a new software package is installed.

We can also expose a REST API from the agent that the users can invoke to trigger this all-in-one refresh explicitly. . This REST API can be used by the users or external systems installing things on the device outside the purview of thin-edge. If the REST API is heavy, we can start off by providing a file-system based API to trigger this refresh by just touching a file or directory.

A single all-in-one refresh API would be sufficient initially. Granular APIs to refresh config list alone or log list alone can be added later if there's a requirement

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a plugin concept, when do we re-evaluate changes in the types? e.g. if a docker log plugin allow users to fetch logs from existing containers, then the user needs to know which containers are currently running in order to fetch it.

An option is to use the concept of signal introduced by #3761

- `get <type> <temp-log-file-path> [--from <timestamp>] [--to <timestamp>] [--filter <filter-text>]`:
Used to fetch the log for the given type within the provided log range.
The log content must be written to the temporary file passed provided by the agent.
- The plugin directory would also have a `conf.d` directory where each plugin can store their configuration in a toml file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm lost here. Why do we need this config settings? What does mean auto_discover = false? Isn't it enough to add .ignore extension or whatever to the plugin like tedge diag collect does? Why useful to have include and exclude settings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The auto_discover flag is useful if a plugin written by a third-party is listing types on its own (e.g: a journald plugin listing all the service units installed on the device), but prefer not to list them all in the cloud. You still need to use that plugin to fetch the logs of a few services using its get command. In that case, since you may not be able to tweak the list command of the third-party plugin. it'd be convenient for you to tell the agent to ignore its list output with this auto_discover setting and explicitly define what you're interested in, with include values.

How the comination of auto_discover and include/exclude settings work are described in this comment thread. I'll update the doc as well with these details. May be the name auto_discover isn't good enough. Something like use_list is clearer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining. I understood from your comment and the other thread. I prefer use_list as it sounds clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though if "auto_discover" is disabled, we'd expect the user to explicitly define the log entry which is the mapping between the log name (e.g. "mosquitto") and the plugin type that is capable of fetching the log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. When auto_discover is off, when a new log entry has to be added, its include entry must be added to the config, most-likely via a drop-in extension config at /etc/tedge/log-plugins.d.

Comment on lines +52 to +53
- The `auto_discover` and `include`/`exclude` configs are used by the `tedge-agent`
to process the supported types listed by that plugin.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm missing is how these settings are used by tedge-agent.

Guessing:

  • The auto_discover and include/exclude configs are under a docker record => this is about a plugin named docker.
  • auto_discover=true => run the plugin with the --list argument.
  • include => list of types to be used when auto_discover=false.
  • exclude => list of types to be removed from the --list.

Am I correct? It would be good to clarify as @rina23q is also lost.

Copy link
Contributor Author

@albinsuresh albinsuresh Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm missing is how these settings are used by tedge-agent.

Guessing:

  • The auto_discover and include/exclude configs are under a docker record => this is about a plugin named docker.

Correct

  • auto_discover=true => run the plugin with the --list argument.

Correct

  • include => list of types to be used when auto_discover=false.

Correct for the auto_discover=false case. But the same include pattern can be applied as an additional filter the list output when auto_discover=true as well .

  • exclude => list of types to be removed from the --list.

Correct

Am I correct? It would be good to clarify as @rina23q is also lost.

Yes, I'll detail it out.

Comment on lines +43 to +50
[docker]
auto_discover=true

[[docker.include]]
pattern = "nginx"

[[docker.exclude]]
pattern = "kube-*"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[docker]
auto_discover=true
[[docker.include]]
pattern = "nginx"
[[docker.exclude]]
pattern = "kube-*"
[plugin.docker]
auto_discover=true
[[plugin.docker.include]]
pattern = "nginx"
[[plugin.docker.exclude]]
pattern = "kube-*"

To ease the TOML parsing.

Copy link
Contributor

@reubenmiller reubenmiller Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I'm not sure if we really need all the nested structures here, so it could be simplified to

[plugin.docker]
auto_discover = true
include_pattern = ".*tedge.*"
exclude_pattern = "^systemd-.*"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you've proposed has the limitation that include_pattern for a given plugin can only be defined in one file on a single line. Dynamically adding or removing an new include_pattern would be difficult with this structure without complex sed commands. I used the nested array structure to allow users to easily append multiple patterns to the same file as well as define include patterns across multiple extension configurations. The main motivation to support multiple file extensions was to make removals also seamless.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a use-case since that individual plugins can control files via the conf.d folder. User's should be using sed (as it is anyway a flawed / naive logic) to modify these files...previously layout of the tedge-configuration-plugin.toml was changed to the equivalent table layout [[configuration]] was done to add a full item...but that was only because we didn't support config snippets.

Comment on lines +70 to +71
- When the supported types are published over mqtt, their source plugin information is also appended to that type
in the format <log_type>::<plugin_type>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we can address the problem that users face if they have a heterogeneous fleet, and only care about getting the "mosquitto" logs regardless if it is using the file-based plugin or the systemd. By requiring the "::" suffix when using any other plugin apart from the "file" based plugin, it would result in users having to use different command depending on the device.

Is there anyway where we can try to resolve the log type without the provider...and if there is only one match, then it is used, otherwise, the first match is used...if the user wants to be more specific, then can provide the ::<plugin> info...though this would complicate the publishing of the supported types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tricky one. It sounds doable as you've proposed with some additional checks while log types from different plugins are merged. But I'm just wondering if this inconsistency where the types coming from the same plugin appear differently in the cloud, some with the suffix and the others without it, would be more confusing to the user as they wouldn't know which plugin is responsible for that type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though couldn't we just trigger an update of the list again, and then check for a match (either with or without the type)? It would help with the management of a heterogeneous fleet (which is a real benefit). But we can do this in a follow up PR.

Comment on lines +140 to +141
Instead of the `exeucting` phase of `config_update` workflow doing everything in that single step,
break it down into multiple smaller stages that corresponds to each plugin command/phase as well as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to evaluate the phase 2 proposal, as I'm not convinced that these builtin internal functions will be easy to use/document, but I believe with the PoC we might get some clarity in the right direction once we have something semi-functional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an initial approach, I would even be tempted to make a tedge (or tedge-agent) command that would call the appropriate function with information provided as arguments, as this would be more re-usable for other integrations rather than relying on this workflow specific api calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config plugin spec will be refined further in a separate PR, although there was a general consensus on not going ahead with this specific phase 2 proposal.

Copy link
Contributor

@reubenmiller reubenmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Thanks again for the nice feature that I'm sure many users will be able to benefit from.

@didier-wenzek didier-wenzek force-pushed the feat/extensible-config-management branch from ebcb0bb to 6425dcf Compare October 2, 2025 11:15
@didier-wenzek didier-wenzek added this pull request to the merge queue Oct 2, 2025
Merged via the queue into thin-edge:main with commit 5c259d5 Oct 2, 2025
51 of 52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

theme:troubleshooting Theme: Troubleshooting and remote control

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants